fix: worker backlog by johnyeocx · Pull Request #873 · useautumn/autumn

johnyeocx · 2026-03-03T17:33:06Z

Summary by cubic

Prevents SQS worker backlog by timing out stuck jobs, running migration jobs without blocking the poller, and recycling/respawning workers more aggressively. Also stabilizes a billing migration test with a fixed invoice count.

Bug Fixes
- Enforce 25s per-message processing timeout (below 30s visibility); migration jobs are exempt.
- Split migration jobs from the poll loop and run them fire-and-forget; log and send failures to Sentry.
- Add idle self-kill after 5 zero-message intervals (only after some work done) to trigger cluster respawn.
- Lower recycle threshold from 500k to 50k messages to reduce memory buildup.
- Update migrate-paid test to assert a fixed invoice count (2) for stability.

^{Written for commit 29b208b. Summary will update on new commits.}

Greptile Summary

This PR addresses worker backlog issues by introducing idle self-kill, per-message timeouts, and fire-and-forget dispatch for migrations. The recycle threshold is also reduced from 500k to 50k messages.

Key changes:

Bug fixes — Reduced MAX_MESSAGES_BEFORE_RECYCLE from 500,000 → 50,000 to recycle workers more aggressively and prevent memory buildup.
Improvements — Added IDLE_SELF_KILL_THRESHOLD (5 intervals): workers exit cleanly after 5 minutes of zero messages, allowing cluster respawn.
Improvements — Added MESSAGE_TIMEOUT_MS = 25_000 and withTimeout helper: non-migration messages that stall beyond 25s raise an error and release the polling loop rather than blocking indefinitely.
Improvements — Migration jobs are now dispatched fire-and-forget, preventing them from blocking the regular message batch since they're already deleted from the queue before processing.
Bug fixes — Updated migrate-paid-2 test to hardcode count: 2 instead of dynamically capturing invoiceCountBefore.

Issues identified:

The idle self-kill logic can terminate the worker while long-running fire-and-forget migration jobs are still executing in the background, since metrics are only updated upon job completion.
The test change to hardcoded invoice count weakens the "no new invoice created" invariant, making it inconsistent with sibling tests that use dynamic capture.

Confidence Score: 2/5

PR has two issues that reduce safety: a production correctness race condition in the idle self-kill logic with fire-and-forget migrations, and a test quality regression that weakens the migration invariant check.
The idle self-kill mechanism can trigger while long-running migration jobs are still executing in the background, potentially killing workers mid-operation. Additionally, the test change to hardcoded invoice counts weakens the critical invariant check and is inconsistent with sibling tests. Both issues should be addressed before merge.
server/src/queue/initWorkers.ts (production correctness), server/tests/integration/billing/migrations/migrate-paid.test.ts (test quality regression)

Sequence Diagram

sequenceDiagram
    participant PL as Polling Loop
    participant Q as SQS Queue
    participant H as handleSingleMessage
    participant P as processMessage

    PL->>Q: ReceiveMessageCommand (up to 10 msgs)
    Q-->>PL: messages[]

    loop For each message
        PL->>PL: Parse job.name
        alt job.name === Migration
            PL-->>H: fire-and-forget (no await)
            H->>Q: DeleteMessageCommand (immediately)
            H->>P: processMessage (no timeout)
            Note over H,P: Long-running, already deleted from queue
        else Regular message
            PL->>PL: push to regularMessages[]
        end
    end

    PL->>H: Promise.allSettled(regularMessages)
    H->>P: withTimeout(processMessage, 25s)
    P-->>H: result
    H-->>PL: {id, receiptHandle} | null

    PL->>Q: DeleteMessageBatchCommand(toDelete)

    Note over PL: Every 60s: logStatsAndCheckZeroMessages()
    alt messagesProcessed == 0 for 5 intervals AND totalProcessed > 0
        PL->>PL: process.exit(0) [⚠ RACE: may kill while migrations running]
    end
    alt totalMessagesProcessed >= 50,000
        PL->>PL: process.exit(0) [recycle]
    end

_{Last reviewed commit: 29b208b}

Context used:

Rule from dashboard - CLAUDE.md (source)
Context from dashboard - When generating the key changes section of the summary, please tag each change with one or many of t... (source)

vercel · 2026-03-03T17:33:08Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
autumn-vite	Ready	Preview, Comment	Mar 3, 2026 5:34pm

cubic-dev-ai

1 issue found across 2 files

Confidence score: 4/5

This PR is likely safe to merge, with only a small test-quality risk rather than a production behavior issue.
In server/tests/integration/billing/migrations/migrate-paid.test.ts, hardcoding invoice count to 2 can miss regressions in “no new invoice created” logic if fixture/setup counts change, making the test brittle.
Severity is moderate-low (4/10) and scoped to test robustness, so merge risk is minimal but worth tightening soon by comparing against a captured pre-migration count.
Pay close attention to server/tests/integration/billing/migrations/migrate-paid.test.ts - the fixed expected count can hide migration regressions when baseline data changes.

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="server/tests/integration/billing/migrations/migrate-paid.test.ts">

<violation number="1" location="server/tests/integration/billing/migrations/migrate-paid.test.ts:204">
P2: Hardcoding the invoice count to 2 no longer verifies “no new invoice created” and makes the test brittle if the initial invoice count changes. Capture the pre-migration invoice count and compare against it instead.</violation>
</file>

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.}

cubic-dev-ai · 2026-03-03T17:35:53Z

server/tests/integration/billing/migrations/migrate-paid.test.ts

 	await expectCustomerInvoiceCorrect({
 		customer,
-		count: invoiceCountBefore,
+		count: 2,


P2: Hardcoding the invoice count to 2 no longer verifies “no new invoice created” and makes the test brittle if the initial invoice count changes. Capture the pre-migration invoice count and compare against it instead.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At server/tests/integration/billing/migrations/migrate-paid.test.ts, line 204: <comment>Hardcoding the invoice count to 2 no longer verifies “no new invoice created” and makes the test brittle if the initial invoice count changes. Capture the pre-migration invoice count and compare against it instead.</comment> <file context> @@ -202,7 +201,7 @@ test.concurrent(`${chalk.yellowBright("migrate-paid-2: allocated seats with pric await expectCustomerInvoiceCorrect({ customer, - count: invoiceCountBefore, + count: 2, }); </file context>

greptile-apps

_{2 files reviewed, 2 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-03-03T17:38:39Z

server/src/queue/initWorkers.ts

+		if (
+			consecutiveZeroMessageIntervals >= IDLE_SELF_KILL_THRESHOLD &&
+			totalMessagesProcessed > 0
+		) {
+			console.log(
+				`${logPrefix()} Idle self-kill: 0 messages for ${consecutiveZeroMessageIntervals} intervals after processing ${totalMessagesProcessed} total. Exiting for cluster respawn.`,
+			);
+			process.exit(0);
+		}


Idle self-kill can trigger while fire-and-forget migration jobs are still executing.

Migration jobs are dispatched with .catch() without being awaited (line 275), and messagesProcessed only increments when handleSingleMessage completes. If a migration job runs for 5+ minutes and no regular messages are processed in the interim, consecutiveZeroMessageIntervals reaches 5, causing process.exit(0) while the migration is still mid-flight.

Consider tracking active migration job count:

let activeMigrationJobs = 0; // When dispatching a migration job (line 275): activeMigrationJobs++; handleSingleMessage({ sqs, message, db }) .catch((error) => { console.error(...); Sentry.captureException(error); }) .finally(() => activeMigrationJobs--); // In idle self-kill check (line 88): if ( consecutiveZeroMessageIntervals >= IDLE_SELF_KILL_THRESHOLD && totalMessagesProcessed > 0 && activeMigrationJobs === 0 ) { // safe to exit }

Prompt To Fix With AI

This is a comment left during a code review. Path: server/src/queue/initWorkers.ts Line: 88-96 Comment: Idle self-kill can trigger while fire-and-forget migration jobs are still executing. Migration jobs are dispatched with `.catch()` without being awaited (line 275), and `messagesProcessed` only increments when `handleSingleMessage` completes. If a migration job runs for 5+ minutes and no regular messages are processed in the interim, `consecutiveZeroMessageIntervals` reaches 5, causing `process.exit(0)` while the migration is still mid-flight. Consider tracking active migration job count: ```ts let activeMigrationJobs = 0; // When dispatching a migration job (line 275): activeMigrationJobs++; handleSingleMessage({ sqs, message, db }) .catch((error) => { console.error(...); Sentry.captureException(error); }) .finally(() => activeMigrationJobs--); // In idle self-kill check (line 88): if ( consecutiveZeroMessageIntervals >= IDLE_SELF_KILL_THRESHOLD && totalMessagesProcessed > 0 && activeMigrationJobs === 0 ) { // safe to exit } ``` How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-03-03T17:38:39Z

server/tests/integration/billing/migrations/migrate-paid.test.ts

 	// CRITICAL: No new invoice created
 	await expectCustomerInvoiceCorrect({
 		customer,
-		count: invoiceCountBefore,
+		count: 2,
 	});


Hardcoded invoice count weakens the migration invariant test.

Previously, this test captured invoiceCountBefore and asserted count: invoiceCountBefore, directly verifying "the migration does not create new invoices." By hardcoding count: 2, the test now only checks "there are exactly 2 invoices at this point" — a different guarantee that couples to test setup state.

This is inconsistent with migrate-paid-1 (line 110), migrate-paid-3 (line 298), and migrate-paid-4 (line 396), which all use the dynamic invoiceCountBefore pattern. If the test setup changes or produces an additional invoice earlier, the assertion will silently pass even if the migration itself created an extra invoice.

Consider reverting to the dynamic pattern to maintain the CRITICAL invariant:

Suggested change

// CRITICAL: No new invoice created

await expectCustomerInvoiceCorrect({

customer,

count: invoiceCountBefore,

count: 2,

});

// Verify initial state

let customer = await autumnV1.customers.get<ApiCustomerV3>(customerId);

const invoiceCountBefore = customer.invoices?.length ?? 0;

// ... later ...

await expectCustomerInvoiceCorrect({

customer,

count: invoiceCountBefore,

});

Prompt To Fix With AI

This is a comment left during a code review. Path: server/tests/integration/billing/migrations/migrate-paid.test.ts Line: 201-205 Comment: Hardcoded invoice count weakens the migration invariant test. Previously, this test captured `invoiceCountBefore` and asserted `count: invoiceCountBefore`, directly verifying "the migration does not create new invoices." By hardcoding `count: 2`, the test now only checks "there are exactly 2 invoices at this point" — a different guarantee that couples to test setup state. This is inconsistent with `migrate-paid-1` (line 110), `migrate-paid-3` (line 298), and `migrate-paid-4` (line 396), which all use the dynamic `invoiceCountBefore` pattern. If the test setup changes or produces an additional invoice earlier, the assertion will silently pass even if the migration itself created an extra invoice. Consider reverting to the dynamic pattern to maintain the CRITICAL invariant: ```suggestion // Verify initial state let customer = await autumnV1.customers.get<ApiCustomerV3>(customerId); const invoiceCountBefore = customer.invoices?.length ?? 0; // ... later ... await expectCustomerInvoiceCorrect({ customer, count: invoiceCountBefore, }); ``` How can I resolve this? If you propose a fix, please make it concise.

fix: worker timeout

29b208b

johnyeocx requested a review from ay-rod as a code owner March 3, 2026 17:33

johnyeocx merged commit 43d03ad into main Mar 3, 2026
8 of 11 checks passed

vercel bot deployed to Preview March 3, 2026 17:34 View deployment

cubic-dev-ai bot reviewed Mar 3, 2026

View reviewed changes

greptile-apps bot reviewed Mar 3, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: worker backlog#873

fix: worker backlog#873
johnyeocx merged 1 commit intomainfrom
fix/worker-backlog

johnyeocx commented Mar 3, 2026 •

edited by greptile-apps bot

Loading

Uh oh!

vercel bot commented Mar 3, 2026 •

edited

Loading

Uh oh!

Uh oh!

cubic-dev-ai bot left a comment

Uh oh!

cubic-dev-ai bot Mar 3, 2026 •

edited

Loading

Uh oh!

greptile-apps bot left a comment

Uh oh!

greptile-apps bot Mar 3, 2026

Uh oh!

greptile-apps bot Mar 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

johnyeocx commented Mar 3, 2026 • edited by greptile-apps bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by cubic

Greptile Summary

Confidence Score: 2/5

Sequence Diagram

Uh oh!

vercel bot commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai bot Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

johnyeocx commented Mar 3, 2026 •

edited by greptile-apps bot

Loading

vercel bot commented Mar 3, 2026 •

edited

Loading

cubic-dev-ai bot Mar 3, 2026 •

edited

Loading